Explore and summarize Red Wine data by Sai Raj Reddy
Dataset Introduction
This is a clean and wrangled dataset based on research by Cortez et al., 2009 to explore and mine quality of red wine. This dataset focuses on red variant of of the Portuguese “Vinho Verde” wine. The dataset is designed in such a way that various psychochemical features determine the quality of wine which is represented by sensory feature.
str(red_wine_data)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Univariate Plots Section

- 8.299 g/dm^3 is the average fixed acidity.
- There must be outliers (Large values of fixed acidity) from the data as the distribution is negatively skewed a bit.

- Chlorides tell about the salt amount present in wine. Since it is highly skewed towards left, there are many outliers present.
- These outlier values can reach to 0.64 g/dm^3 chloride.

- It is normally distributed.
- The mean and median values being equal to 0.99 g/cm^3 explains the normality.

- pH values are also normally distribited.
- The most common range is 3 - 4 pH for the red wines.

- Interesting to find that very few wine have large percentage of alcohol mixed in them.

- The fact that the volatile acidity in wines are variably small is proven with this new variable.
- The variation of total acidity is similar to variation of fixed acidity, since volatile acidity in all the wines are very small.

- Volatile acidity is little skewed.

- Most of them are with less citric acid.


- Let’s see how this will vary with Total Sulphur Dioxide

- The distribution is less significantly different from the Free Sulfur Dioxide.

Univariate Analysis

- The output variable of this dataset is ‘quality’.
- This gives us the overall picture of how the input can affect the quality.
- Therefore quality can be considered as the basis for any multivariate or bivariate analysis from this point.
What is the structure of your dataset?
- The dataset contains 1599 rows
- 12 input variables and 1 output variable. All the variables are numeric.
What is/are the main feature(s) of interest in your dataset?
- The output variable ‘quality’ which is based on the sensory data.
- Major interesting features to explore are :- pH, Alchohol and acidity.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
- Sulphates and density seems to be interesting and are good fit for further analysis.
Did you create any new variables from existing variables in the dataset?
- Yes, I created a new variable for total amount of acidity.
- The fact that the volatile acidity in wines are variably small is proven with this new variable.
- The variation of total acidity is similar to variation of fixed acidity, since volatile acidity in all the wines are very small.
Bivariate Plots Section

- This shows the correlation between pH and Density.
- It clearly indicates how it is inversly proportional in descending manner.

- This shows the direct relationship between pH and fixed acidity.
- They are directly proportional.

- It’s quite obvious of the fact that all wines contain some amount of alcohol.
- We can observe that higher the quality of wine, higher is its alcohol presence.
- The box plots convey how the quantity of alcohol increases from its average point as the quality increases.

- Not a very strong relationship, as the correlation looks very skewed.
- I tried this as I wanted to interpret how alcohol can affect residual sugars left.

- Spare correlation.
- Acidity doesn’t correlate much with alcohol.

- There is somewhat a good relationship between sulphates content and alchohol presence.
- It’s inversely proportional in an ideal situation.

- We can find interesting patterns here, as pH values are affected slightly because of alcohol presence.

- Usually citric acid is one of the strongest forms of acid.
- There’s a good correlation here to indicate how citric acid affects the pH

- This also shows some mid-way correlation between sulphaes and pH.
- It’s interesting to see how sulphates affect only the medium range pH values. (3 - 3.5)
Bivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
- We can observe that higher the quality of wine, higher is its alcohol presence.
- The box plots convey how the quantity of alcohol increases from its average point as the quality increases.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Yes, the relationship that is established between pH and density. They are inversely proportional.
What was the strongest relationship you found?
- The direct relationship between pH and fixed acidity.
- They are directly proportional.
Multivariate Plots Section

- The low rated wines are found more in the high density and low alcohol area i.e. top-left.
- The high rated wines are concentrated in low density high alcohol i.e bottom-right.
- Considering the regressionn lines for every quality stub, we find that the lines for the lower quality categories tend towards the left and they have steeper slope.

- Higher is the alcohol content, lower the density of wine.
- Considering the total acidity as the color scale to the alcohol, density graph. We see that the total acidity is higher in the wine with high density.
- It is also spread across the wine with various levels of alcohol presence.

- The variation is low from the deciding factor of pH.

- Citric acid correlation and its variance acorss the density is strong.

- The strongly correlated points of density vs alcohol have low residual sugars left.
- This shows how the correct combination of density and alcohol can lead to low residual sugar.
Multivariate Analysis
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
- Yes, we can observe that in the second plot.
- Higher is the alcohol content, lower the density of wine.
- Considering the total acidity as the color scale to the alcohol, density graph. We see that the total acidity is higher in the wine with high density.
- It is also spread across the wine with various levels of alcohol presence.
Were there any interesting or surprising interactions between features?
- Yes the first plot is interesting to check how the lower quality wine categories have steeper slope.
- The low rated wines are found more in the high density and low alcohol area i.e. top-left.
- The high rated wines are concentrated in low density high alcohol i.e bottom-right.
Final Plots and Summary
Plot One

Description One
- The output variable of this dataset is ‘quality’.
- This gives us the overall picture of how the input can affect the quality.
- Therefore quality can be considered as the basis for any multivariate or bivariate analysis from this point.
Plot Two

Description Two
- This shows the direct relationship between pH and fixed acidity.
- The variables – pH and Fixed Acidity have an inverse relationship. More the pH, lesser the fixed acidity and vice versa.
Plot Three

Description Three
- The low rated wines are found more in the high density and low alcohol area i.e. top-left.
- The high rated wines are concentrated in low density high alcohol i.e bottom-right.
- Considering the regressionn lines for every quality stub, we find that the lines for the lower quality categories tend towards the left and they have steeper slope.
Reflection
Exploration of the Red Wine dataset helped me understand the interesting relationships between the features and especially those between pH and density, alcohol and acidity to quality.
Since most of the wine data we dealt is with medium quality, it was possible to weaken the inference strength. This means more data, preferably 50,000+ rows of sensored wine values should help to give strong inference.
It’s easier to perform regression analysis of the quality with larger amount of data.
Future work can be explored in the areas how features like Sulfur Dioxide and Chlorides also affect the quality. This needs deeper understanding of how the psychochemical factors determine the quality. Especially for the multivariate plots.